Red wine analysis by Kort Linden

This data set is about the chemical make up and quality test scores of red wine. I am setting out to find what qualities might show a relationship to taste test scores and the various chemical attributes’ relative correlations.

Citation

Citation #1 http://rpubs.com/Daria/57835 Red and White Wine Quality by Daria Alekseeva

Citation #2 This dataset is public available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

ABOUT THE DATA SET

For more inforamtion about this data set please see wineQualityinfo.txt

Some descriptive statistics and basic information

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Summary of data

There is 1599 objects and 13 variables in this set. If we look at the names: [1] “X” “fixed.acidity”
[3] “volatile.acidity” “citric.acid”
[5] “residual.sugar” “chlorides”
[7] “free.sulfur.dioxide” “total.sulfur.dioxide” [9] “density” “pH”
[11] “sulphates” “alcohol”
[13] “quality”

Some initial thoughts about the data

We can imagine that pH, acidities, and alcohol might be be naturally related. X looks to just be a key identifier number and is not needed. The strongest alcohol content is 14.90 percent and the lowest is 8.4 percent. The median quality score is a 6 with a mean of 5.636 and a max of 8. The lowest quality score was a 3 yuck!
I would also bet that sulfur dioxide is negativly related to quality score since sulfites are often found in lower quality wines as far as I know.

Univariate Plots Section

In the following twelve graphs, I am taking a first look at the variables. I will mostly be looking for abnormal distributions and outliers.

Univariate Analysis Summary

The distributions that seemed abnormal were: Alcohol, Residual Sugar, Free Sulfur Dioxide, Sulphates, Volatile Aciditiy, Citric Acid, Chlorid, and Total Sulfur Dixoide. The most common issue was a left skew, followed by long tails and some with extreme outliers.

Bivariate Plots Section

Bivariate Analysis Summary

I used ggpairs to analyze the correlation and distributions of the variables. Scatter plots varied, but for our main variable that I am interested in, “Quality”, you can see the distribution forms in stripes due to the fact that scores were whole numbers 3,4,5,6, 7, or 8.

The most related of all the paired variables were: 1. Fixed.acidity and pH, with a corrcoef of -.68 2. Density and Fixed.acidity with a corrcoef of .68 3. Total sulfur dioxide and free dioxide with a corrcoef of .67 though I believe that free dioxide may be a subset of total sulfur dioxide.

The most related of all the variables to Quality: 1. Alcohol to Quality with a corrcoef .48 2. Sulphates to Quality with a corrcoef .25 3. Citric.Acid to Quality with a corrcoef .23

Wow, who knew that 48% of a taste test result could potentially be explained by the quantity of alcohol?

Multivariate Plots Section

Multivariate Analysis

Treating quality as a categorical variable allows us to see some really interesting scatter plots. We can see beyond the .48 correlation coefficient and say that the mean alcolhol levels for 7 and 8 wines was much higher than the rest, but now always. However, 3 and 4 quality wines had a higher mean alcohol than 5 wines which had a very tight quartile range comapred to the other categories and had the lowest mean. Finally citric acid showed some of the most dramatic differences in means from one category to the next although the usefulness for prediction was less evident as we saw more crossing and convergence in the quality trend lines.


Final Plots and Summary

Plot One

The above plot shows the various wine alcohol contents for our sample across the range of quality scores. There is a clear positive correlation, though not very strong. A 7 or 8 quality wine rarely has an alcohol content under 10 percent. I used an alpha of 1/8 to reduce over plotting.

Plot Two

Repeating the box plot for the third most related variable, citric.acid, shows suprisingly that the wine qualities of 7 and 8 have vastly higher amounts of citric acid. If you are trying to pick a top quality wine, 7 and up, this variable is about as important as alcohol even though the overall correlation coefficients were very different when considering all scores. If I were to further analyze this, I may group citric acid scores into 3 groups low, middle, high if I were going to use it in a prediction algorythm since the means for the 3 & 4, 5 & 6, 7 & 8 are very similar to eachother. There are some outliers, and most are in the 7 category.

Plot Three

This graph illustrates the relationship of quality to its two most correlated values: Alcohol and sulphates. It appears that to be an 8 quality score wine, the wine must reside in the upper right quadrant and have high citric acid levels. The 8 wines almost always have to have an alcohol content above 11 and have at least a .2 citric acid and sulphates of at least .6. 3 and 4’s are the most diverse but 5’s typically have a much higher alcohol content - as high or higher than most 8’s but they lack in the other factors like sulphates and (citric acids not shown above).


Reflection

I was suprised that you could take a subjective score, albiet from trained testers, and have muiltiple facets of the objective data correlate. When I used ggpairs, I saw that about half of the variables had fairly normal distributions. On the other hand, many had left skew as well. I think it is extremely important to make sure you as to whether the data is categorical or continuous.

This data set’s smaple size is not huge and contains one wine from one county. Therefore, we would need more data to create a meaningful prediction model. However, it does show promise for predicting wine quality based on chemical factors given a more expansive sample.

When it comes to quality taste score, the data is clear that alcohol plays a clear roll; however, the signifigance of the suphates and citric acid levels is debatable unless you apply it to specific quality categories. One of the issues I encountered was that quality score was actually a categorical value.
So, eventually, I converted the data to categorical. However, after the last review, I was informed that ggplot could take the info and treat it as categorical thus solving the problem. Additionally, some of the variables seemed to be highly related, especially the following 3: 1. Fixed.acidity and pH, with a corrcoef of -.68 (acidity seems to be part of the equation for pH) 2. Density and Fixed.acidity with a corrcoef of .68 (acid appears to be more dense than the rest of the chemical make up) 3. Total sulfur dioxide and free dioxide with a corrcoef of .67 (I think free dioxide is a subset of total sulfur dioxide)

For future studies, I would recommend analyzing price and grouping categories 3-5 into a low quality wine category to reduce noise and increase the correlation for those wines of a 6 and up quality score.